Explore and Summarise Data

Varun Kumar Sharma

November 1st,2018

Introduction

In this project following analysis will be done using R and EDA (Exploratory Data Analysis) techniques to explore dataset named wineRedQuality.Dataset is derived from the source link P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Data Analysis using R and EDA

Univariate Plots Section

Let’s see which variables are included in this dataset.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Let us now check the variable types :

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 1599 observations and 13 variables (including X) .We can see that X variable appears to be an index value for each observation.Also,we notice that quality variables are in integers and all other variables are numerical.

Let us drop the X variable which is used only for indexing purpose :

Checking variable after deleting X

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Univariate Analysis

Lets us check the distribution of each variable by plotting histograms :

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Wine Quality

After checking the ratings and variable distribution, I’ll create another categorical variable, classifying the wines as ‘bad’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).

##     bad average    good 
##      63    1319     217

Now we will visualize the distribution variability of each factor by plotting each variable histogram:

## Using rating as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From above plots it is seen most of the variables are closer to normal distribution except “chlorides” and “residual sugar”. This seems due to the outliers which we can exclude and replot histograms.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 80 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

After excluding the outliers the distribution for Residual Sugar and Chlorides also looks normal.

What is the structure of your dataset?

There are 1599 observations and 13 variables (including X) .We can see that X variable appears to be an index value for each observation.Also, quality variables are in integers and all other variables are numerical.

What is/are the main feature(s) of interest in your dataset?

I am interested in the quality ratings of red wine and the which variables influence the red wine’s quality ratings.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

As we see above plot, most of the red wines in the dataset have quality ratings of 5 and 6 as seen above.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

For wines the first 2 factors comes in mind is alcohol content and density from the given dataset.Therefore it would be interesting to analyse the relationship between alchohol content and wine density. let’s have a look at relation of these 2 variables:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

This shows right skewed graph where density is maximum with alchohol content is between 9 to 10%. We can see that the density decreases as the alchohol content increases.

Did you create any new variables from existing variables in the dataset?

Yes, I have created another variable “rating” to categorize quality of wine into three groups (good,average,bad) to have good & summarized view of wine quality in graphs and figures.

Of the features you investigated, were there any unusual distributions?

It was seen that most of the variables are more or less normally distributed except “chlorides” and “residual sugar”. This was due to the outliers which were negated by excluding 95th percentile of these 2 variables and reploted histograms.

Bivariate Analysis & Plots Section

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

To answer above, let’s create a scatterplot matrix to check correlation of these variables :

## Warning in ggscatmat(wine_data): Factor variables are omitted in plot

Did you observe any interesting relationships between the other features ?

Here are some interesting Correlations derived from above scatter plot :

It seems that the following variables have relatively highest positive correlations to wine quality:

  • alcohol:quality = 0.48
  • sulphates:quality = 0.25
  • citric acid:quality = 0.23
  • fixed.acidity:quality = 0.12

Here are the variables that have relatively highest negative correlation coefficients to wine quality :

  • volatile.acidity:quality = -0.39
  • total.sulphur.dioxide:quality = -0.19
  • density:quality = -0.17
  • chlorides:quality = -0.13

So we observe that above volatile acids are negatively correlated with quality of red wine.

Here are the highest positive and negative correlation reflecting in scatter plot :

  • fixed.acidity:citirc.acid = 0.67
  • fixed.acidity:density = 0.67
  • free.sulfur.dioxide:total.sulfur.dioxide = 0.67
  • alcohol:quality = 0.48

  • fixed.acidity:pH = -0.68
  • volatile.acidity:citirc.acid = -0.55
  • citric.acid:pH = -0.54
  • density:alcohol = -0.50

From the correlation matrix we created above, I think it would be interesting to analyze interraction between following variables. Let’s see how some of the important variables compare, plotted against each other in bad,average & good quality wines:

Interestingly,it is seen from above plots that Density and fixed acidity both are positively correlated with each other.Even though the fixed acidity is positively and density is negatively correlated with wine quality.

What was the strongest relationship you found?

The strongest relationship appears between Wine quality and Alcohol content. It appears that wines with high alchohol content has more high quality ratings.

Multivariate Analysis & Plots Section

Were there any interesting or surprising interactions between features?

Let’s create plots to check some of the strong interaction I think would be interesting to see between variables based on correlation matrix for “good”, “average” & “bad” wines :

From above graphs, I am choosing density, alcohol and sulphates to check in details their influence on wine quality by creating the following plots in this section.

Density, Alcohol & Wine Quality

As seen above that Good quality wines have less of density and more alcohol. Above graph shows the kernel density as geom_density computes and draws kernel density estimate, which is a smoothed version of the histogram.

Let’s create scatter plots to check influence of the chemical density and alcohol on wine quality

Density, Sulphates & Wine Quality

We see in above graph that Good quality wines have more of sulphate and less in kernel density. Here is the scatter plot to check influence of the chemical density and alcohol on wine quality

Above graph shows that for good wine quality the density is on lesser side and sulphates are more as compared to bad and average quality wines

Thus,from all our above observations it appears that “Alcohol” and “Sulphates” are positively correlated with good quality wine but “Density” is negatively correlated with quality and found less in good quality wines.

Final Plots and Summary

Plot One

In our final plots, let’s check how all the acids in provided dataset influence the wine quality :

Fixed Acidity vs. Quality

## [1] "Median of fixed.acidity by quality:"
## wine_data$quality: 3
## [1] 7.5
## -------------------------------------------------------- 
## wine_data$quality: 4
## [1] 7.5
## -------------------------------------------------------- 
## wine_data$quality: 5
## [1] 7.8
## -------------------------------------------------------- 
## wine_data$quality: 6
## [1] 7.9
## -------------------------------------------------------- 
## wine_data$quality: 7
## [1] 8.8
## -------------------------------------------------------- 
## wine_data$quality: 8
## [1] 8.25

We can see that the there is increase of fixed acidity from average quality rating (6) to high quality rating (7). Also big dispersion of fixed acidity value from across the scale which indicates that fixed acidity value can’t be the only factor for good quality wine and quality depends on other factors too.

Citric Acid vs. Quality

## [1] "Median of citric.acid by quality:"
## wine_data$quality: 3
## [1] 0.035
## -------------------------------------------------------- 
## wine_data$quality: 4
## [1] 0.09
## -------------------------------------------------------- 
## wine_data$quality: 5
## [1] 0.23
## -------------------------------------------------------- 
## wine_data$quality: 6
## [1] 0.26
## -------------------------------------------------------- 
## wine_data$quality: 7
## [1] 0.4
## -------------------------------------------------------- 
## wine_data$quality: 8
## [1] 0.42

We see that for good quality ratings the citric acid is on higher side, which states that higher the citric acid, higher will the quality of wine but ofcourse the qualtity should be measured and not in excess.

Volatile Acidity vs. Quality

## [1] "Median of volatile.acidity by quality:"
## wine_data$quality: 3
## [1] 0.845
## -------------------------------------------------------- 
## wine_data$quality: 4
## [1] 0.67
## -------------------------------------------------------- 
## wine_data$quality: 5
## [1] 0.58
## -------------------------------------------------------- 
## wine_data$quality: 6
## [1] 0.49
## -------------------------------------------------------- 
## wine_data$quality: 7
## [1] 0.37
## -------------------------------------------------------- 
## wine_data$quality: 8
## [1] 0.37

Lower volatile acidity seems to mean higher wine quality, as it is reflected in correlation matrix i.e volatile.acidity:quality = -0.39

pH vs. Quality

## [1] "Median of pH by quality:"
## wine_data$quality: 3
## [1] 3.39
## -------------------------------------------------------- 
## wine_data$quality: 4
## [1] 3.37
## -------------------------------------------------------- 
## wine_data$quality: 5
## [1] 3.3
## -------------------------------------------------------- 
## wine_data$quality: 6
## [1] 3.32
## -------------------------------------------------------- 
## wine_data$quality: 7
## [1] 3.28
## -------------------------------------------------------- 
## wine_data$quality: 8
## [1] 3.23

In above graph we see that the pH value should be lesser for higher quality wines.

Description One

From above plots we can conclude that :

  • Volatile (acetic) acid negatively affected wine quality.
  • Other higher acids (or lower pH) is seen in highly-rated wines.
  • Citric acidity had a higher concentration in good quality wines and fixed (tartaric) acid also higher and influence wine quality positively.

Plot Two

Alcohol and Sulphates influence on quality of wine

We will create plots to check the correlation of Alcohol and Sulphates for wines with given quality ratings.

  • Sulphates
## wine_data$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4950  0.5600  0.5922  0.6000  2.0000 
## -------------------------------------------------------- 
## wine_data$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6473  0.7000  1.9800 
## -------------------------------------------------------- 
## wine_data$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7435  0.8200  1.3600
  • Alcohol
## wine_data$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## wine_data$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## wine_data$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Description Two

As seen above that the Alcohol mean value should be around 11.52 (% by volume) and sulphates should be around 0.7435(potassium sulphate) g/dm3.These plot shows that higher the % of alcohol and sulphates results in better wines.

Plot Three

Description Three

This boxplots shows the effect of alcohol content on wine quality.It is seen that higher alcohol content is correlated with higher wine quality. But it is also worth noticing that alchol content alone did not produce a higher quality as shown by the outliers and intervals.


Reflection

It was interesting dataset to explore by using R and EDA techniques. Here I focused to find which variables determines the better quality of red wine. I checked the dataset and cleared some outliers found in histograms for couple of variables to get the precise results. I chose variables based on their correlation coefficients to draw the relations between them and to determine the influence they put on wines quality.

After all the analysis it can be concluded that the major factors are alcohol, acidity and sulphates which determines the wine quality. Quality of wine is positively correlated with alcohol, sulphates and acids (except volatile acids). So good quality wines are rich in these factors. There is negative correlation between pH and wine quality. Sulfur dioxide & Residual sugar doesn’t seems to have much impact on the quality of the wines.

It was interesting to see that even though the fixed acidity is positively and density is negatively correlated with wine quality but it is seen that both are positively correlated with each other. It would be more interesting to add other factors like aging & wine brands as well in future analysis.

I struggled to choose the most appropriate graph and which variables are most strong to compare with each other for a given context. I created and used correlation matrix for given variables to write out a list of the variables comparisions and applicable graphs at my disposal and determined the strengths/weaknesses of each. This made for me easy to choose different plots for various factor combination.